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Abstract 

We introduce the “NoBackTrack” algorithm to train the parameters 
of dynamical systems such as recurrent neural networks. This algorithm 
works in an online, memoryless setting, thus requiring no backpropaga- 
tion through time, and is scalable, avoiding the large computational 
and memory cost of maintaining the full gradient of the current state 
with respect to the parameters. 

The algorithm essentially maintains, at each time, a single search 
direction in parameter space. The evolution of this search direction 
is partly stochastic and is constructed in such a way to provide, at 
every time, an unbiased random estimate of the gradient of the loss 
function with respect to the parameters. Because the gradient estimate 
is unbiased, on average over time the parameter is updated as it should. 

The resulting gradient estimate can then be fed to a lightweight 
Kalman-like filter to yield an improved algorithm. For recurrent neural 
networks, the resulting algorithms scale linearly with the number of 
parameters. 

Small-scale experiments confirm the suitability of the approach, 
showing that the stochastic approximation of the gradient introduced 
in the algorithm is not detrimental to learning. In particular, the 
Kalman-like version of NoBackTrack is superior to backpropagation 
through time (BPTT) when the time span of dependencies in the data 
is longer than the truncation span for BPTT. 

Consider the problem of training the parameters 0 of a dynamical system 
over a variable h G M” subjected to the evolution equation 

Kt + '^) = ( 1 ) 

where / is a fixed function of h and of an input signal x{t), depending on pa¬ 
rameters 9. The goal is online minimization of a loss function uit)) 

between a desired output y{t) at time t and a prediction^ 

y{t) = (2) 

'^The prediction y may not live in the same set as y. Often, y encodes a probability 
distribution over the possible values of y, and the loss is the logarithmic loss i = — \ogPy{y). 
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computed from h{t) and additional parameters (p. 

A typical example we have in mind is a recurrent neural network, 
with activities ai{t) := sigm(hj(t)) and evolution equation hi{t + 1) = 
bi + Efc rkiXk{t) + Y.j Wjiaj{t), with parameter 6 = {hi, rki, Wji)ij^k- 

If the full target sequence y{t)t^[o-T] is known in advance, one strategy is 
to use the backpropagation through time algorithm (BPTT, see e.g. [Jae02]) 
to compute the gradient of the total loss Lt ■= J2j=o with respect to the 
parameters 6 and ip, and use gradient descent on 9 and (p. 

However, if the data y{t + 1) arrive one at a time in a streaming fash¬ 
ion, backpropagation through time would require making a full backward 
computation from time t -|- 1 to time 0 after each new data point becomes 
available. This results in an complexity and in the necessity to store 

past states, inputs, and outputs. A possible strategy is to only backtrack by 
a finite number of time steps [Jae02] rather than going back all the way to 
t = 0. But this provides biased gradient estimates and may impair detection 
of time dependencies with a longer range than the backtracking time range. 

By contrast, methods which are fully online are typically not scalable. 
One strategy, known as real-time recurrent learning (RTRL) in the recurrent 
network community,^ maintains the full gradient of the current state with 
respect to the parameters: 

dh{t) 

de 

which satisfies the evolution equation 



G{t -\- 1) 


df{h{t),x{t),e) 

dh 


G{t) + 


df{h{t),x{t),e) 

do 


( 4 ) 


(by differentiating (1)). Knowing G{t) allows to minimize the loss via a 
stochastic gradient descent on the parameters 6, namely,^ 


0 ^ 0 - ( 5 ) 

with learning rate yt- Indeed, the latter quantity can be computed from Gt 
and from the way the predictions depend on h{t), via the chain rule 


% _ det{Y{h{t),ip),y{t)) 
de dh ^ ^ 


( 6 ) 


However, the full gradient G{t) is an object of dimension dim h x 
dim0. This prevents computing or even storing G{t) for moderately large¬ 
dimensional dynamical systems, such as recurrent neural networks. 

^This amounts to applying forward automatic differentiation. 

^We use the standard convention for Jacobian matrices, namely, dxjdy is the matrix 
with entries dxi / dyj. Then the chain rule writes f| ff = ff ■ This makes the derivatives 
dlt /do into row vectors so that gradient descent is 0 <— 0 — {dit /d0)h 
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Algorithms using a Kalman filter on 6 also'^ rely on this derivative ^ 
(see [Hay04, Jae02] for the case of recurrent networks). So any efficient way 
of estimating this derivative can be fed, in turn, to a Kalman-type algorithm. 

Algorithms suggested to train hidden Markov models online (e.g., [Capll], 
based on expectation-maximization instead of gradient descent) share the 
same algebraic structure and suffer from the same problem. 

1 The NoBackTrack algorithm 

1.1 The rank-one trick: an expectation-preserving reduction 

We propose to build an approximation G{t) of G{t) with a more sustainable 
algorithmic cost; G{t) will be random with the property KG{t) = G{t) for all 
t. Then the stochastic gradient (5) based on G{t) will introduce noise, but 
no bias, on the learning of 0: the average change in 9 after a large number 
of time steps will reflect the true gradient direction. (This is true only if the 
noises on G{t) at different times t are sufficiently decorrelated. This is the 
case if the dynamical system (1) is sufficiently ergodic.) Such unbiasedness 
does not hold, for instance, if the gradient estimate is simply projected onto 
the nearest small-rank or diagonal plus small-rank approximation.^ 

The construction of an unbiased G is based on the following “rank-one 
trick”. 

Proposition 1 (Rank-one trick). Given a decomposition of a ma¬ 
trix A as a sum of rank-one outer products, A = ViwJ, and independent 
uniform random signs e* G { — 1,1}, then 

^ ■= (Ei^iVi) (Ej^jWj) (7) 

satisfies 

EA = ^ ViwJ = A (8) 

i 

that is, A is an expectation-preserving rank-one approximation of A. 

Moreover, one can minimize the variance of A by taking advantage of 
additional degrees of freedom in this decomposition, namely, one may first 
replace Vi and Wi with piVi and Wi/pi for any pi G M*. The choice of pi which 
yields minimal variance of A is when the norms of Vi and Wi become equal, 
namely, pi = \/\\wi\\ / ||uj||. 

"^One may use Kalman filtering either on 0 alone or on the pair {9, h). In the first case, 
^ is explicitly needed. In the second case, all the information about how 6 influences 
the current state h{t) is contained in the covariance between 9 and h, which the algorithm 
must maintain, and which is as costly as maintaining G{t) above. 

®We tried such methods first, with less satisfying results. In practice, consecutive 
projections tend to interact badly and reduce too much the older contributions to the 
gradient. 
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The proof of the first statement is immediate. 

The statement about minimizing variance is proven in Appendix A. Min¬ 
imizing variance thanks to pi is quite important in practice, see Section 1.3. 

The rank-one trick also extends to tensors of arbitrary order; this may 
be useful in more complex situations.® 

The rank-one reduction A depends, not only on the value of A, but also 
on the way A is decomposed as a sum of rank-one terms. In the applications 
to recurrent networks below, there is a natural such choice.^ 


We use this reduction operation at each step of the dynamical system, to 
build an approximation G oiG. A key property is that the evolution equation 
(4) satisfied by G is affine, so that if G{t) is an unbiased estimate of G{t), 
then G{t) -|- is an unbiased estimate of G{t -|- 1). 

This leads to the NoBackTrack algorithm (Euclidean version) described 
in Algorithm 1. At each step, this algorithm maintains an approximation of 
G as 

G = VW~^ -|- ^2 (9) 

i 

ri f 

where e* is the i-th basis vector in space h, and Wi := are sparse vectors. 

To understand this structure, say that G{t — 1) = is a rank-one 
unbiased approximation of G{t — 1). Then the evolution equation (4) for G 
yields + M = W approximation of G{t). 

This new approximation is not rank-one any more, but it can be used to 
perform a gradient step on 6, and then reduced to a rank-one approximation 
before the next time step. 

Note that handling ^ is usually cheap: in many situations, only a small 
subset of the parameter 6 directly influences each component hi{t+ 1) given 
h{t), so that for each component i of the state space, ^ has few non-zero 
components. For instance, for a recurrent neural network with activities 


®The most symmetric way to do this is to use complex roots of unity, for instance, Ui^ 

Ui (8) uii = ERe CkWk)^ where each (i is taken independently at 

random among {1, This involves complex numbers but there is no need to 

complexify the original dynamical system (1). Another, complex-free possibility is to apply 
the rank-one trick recursively to tensors of smaller order, for instance, Ui®Vi®Wi®Xi = 


^.(ui ® Vi) ® {wi (g) a;i) = E (^- SiUi g) Vi)(f2 SjWj (g) xj) 


and then apply independent 


rank-one decompositions in turn to W. SiUi (g) vt and to W . SiWi ® Xj. 

'The rank-one trick may also be performed using random Gaussian vectors, namely A = 
E[^(^^S“^A)] with 5 = Mf), S). This version does not depend on a chosen decomposition 
of A, but depends on a choice of E. Variance can be much larger in this case: for instance, 
if A = vw^ is actually rank-one, then (£u)(ew^) = vvA so that the rank-one trick with 
random signs is exact, whereas the Gaussian version yields which is correct 

only in expectation. This case is particularly relevant because we are going to apply a 
reduction at each time step, thus working on objects that stay close to rank-one. The 
generalization to tensors is also more cumbersome in the Gaussian case. 
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ai{t) := sigm(/ij(t)) and evolution equation hi{t + 1) = 6^ + J2k''"ikXki't) + 
J2j Wjiaj(t), the derivative of hi{t + 1) with respect to the parameter 0 = 
(6, r, W) only involves the parameters bi, rik, Wji of unit i. In such situations, 
the total cost of computing and storing all the rcj’s is of the same order as the 
cost of computing h{t + 1) itself. See Section 1.3 for details on this example. 


After the reduction step of Algorithm 1, w may be interpreted as a “search 
direction” in parameter space 9, while v is an estimate of the effect on the 
current state h(t) of changing 9 in the direction w. The search direction w 
evolves stochastically, but not fully at random, over time, so that on average 
vu}^ is a fair estimate of the actual influence of the parameter 9. 

Note that in Algorithm 1, the non-recurrent output parameters (p are 
trained according to their exact gradient. The rank-one trick is used only 
for the recurrent part of the system. 


By construction, at each step of Algorithm 1, the quantity Gt '■= hrD"''-|- 
ejtcj satishes EG* = However, since the value of 9 changes along 

the algorithm, we must be careful about the meaning of this statement. 
Intuitively, this derivative with respect to 9 is taken along the actual trajectory 
of parameters 9t realized by the algorithm. 

More formally, let 6 = (0O) ■ ■ ■ ,9t,...) be any sequence of parameters. 
Let / be any function depending on this sequence 0, such as the state of the 
system at time t (all functions considered below will depend only on a finite 
initial segment of 6). Define 6 + e ■= {9 q + e,... ,9t + e,...) and say that 
/ has derivative ^ with respect to 6 if f{0 -|- e) = f{6) + + O(e^) for 

small e. 

Thanks to this convention, the evolution equation (4) for the evolution 
of G{t) holds for any sequence of parameters 6, with G{t) dehned as 
The following statement is then easily proved by induction. 


Proposition 2 (Unbiased rank-one gradient estimate for 

DYNAMICAL SYSTEMS). At each ti m e step t, the quantity Gt ■= vvj^ + 
GiwJ from Algorithm 1 is an unbiased estimate of the gradient of the state 
of the system with respect to the parameter: 


EGi 


dh{t) 

36 


where 6 is the sequence of parameters produced by the algorithm. 


(19) 


In particular, for learning rates r] tending to 0, the parameter evolves 
slowly so that the derivative is close to a derivative with respect to the 
current value 9t of the parameter. Thus, in this regime, tends to 
and since Gt is an unbiased estimate of Gt, the situation gets closer and 
closer to an ordinary stochastic gradient descent if rj is small. Presumably 
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Parameters: h(0) (initial state), 9 q, ipo (initial value of the internal 
and output parameters), rjt (learning rate seheme); 

Data: x{t) (input signal), y{t) (output signal); 

Maintains: h{t) (eurrent state), 6, (p (internal and output 
parameters), v (column vector of size dim h), w (column vector of size 
divaO), Wi (sparse column vectors of size dim 6) for i = 1,... ,dim/i. 


Initialization: 0 9q, p ^ po, u 0, w 0, Wj 0; 

for t = 0 to end-of-time do 

Observation step: Compute prediction y(t) = Y{h{t),p) from 
current state h{t). 

Observe y{f) and incur loss it{y{t),y{t)). 

Update step: Compute derivative of loss with respect to output 
parameters, ^ ^ update output parameters: 


dit^ 


( 10 ) 


Compute derivative of loss with respect to current state. 


H ^ 


dit {Y{h{t),p),y{f)) 
dh 


Update internal parameters 9: 


( 11 ) 


9 ^9 -r]t {Hv)w - rjt ( 12 ) 

(this is a gradient step 9 <— 9 — rit{HGY using the current gradient 
estimate G from (9)). 

Reduction step: Draw independent uniform random signs 
Ei = ±1. Let Ci be the z-th basis vector in state space. Compute 
P Pi := \/||tCi|| / ||ej|| for each i. Update 

v ^ pv + J^YiPiei 
w wIp + J^YiWilpi 

Wi 0 

Transition step: Observe new value of input signal x{t) and 
compute next state h{t + 1) = f{h(t),x{t),9). Update estimate G: 

- ^ df{h{t),x{t),9) _ 

^ dh 

. ^ dfi{h{t),x{t),9) ^ 


t i — t -j- 1 

end 

Algorithm 1: NoBackTrack algorithm. Euclidean version. 
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(16) 

(17) 

(18) 


(13) 

(14) 

(15) 









this happens whenever the learning rate is small enough for 0 not to change 
too much within a time range corresponding to a “forgetting time” of the 
dynamical system, although more work is needed here. 

1.2 Feeding the gradient estimate to an extended Kalman 
filter 

The Euclidean version of the NoBackTrack algorithm presented in Algorithm 1 
is not enough to obtain good performance fast. Online estimation often 
yields best results when using filters from the Kalman family. We refer 
to [Hay04, Jae02] for a discussion of Kalman filtering applied to recurrent 
neural networks. 

Kalman-based approaches rely on a covariance matrix estimate P{t) on 
9. After observing y(t), the parameter 6 gets adjusted via® 

( 20 ) 

where the derivative of the loss with respect to 9 is computed, as above, via 
the product of the derivative of the loss with respect to the current state 
h(t), and the derivative G{t) = 

Maintaining a full covariance matrix on 9 is usually too costly. However, 
having a good approximation of P{t) is not as critical as having a good 
approximation of Indeed, given an unbiased approximation of any 
symmetric positive definite matrix P{t) which changes slowly enough in time 
will yield an unbiased trajectory for 9. 

Thus, we will use more aggressive matrix reduction techniques on P{t), 
such as block-diagonal (as in [Hay04]) or quasi-diagonal [01115a] approxima¬ 
tions. In our setting, the main point of using the covariance matrix is to 
get both a sensible scaling of the learning rate for each component of 0, and 
reparametrization-invariance properties [01115a]. 

In Kalman filtering, in the case when the “true” underlying parameter 
9 in the extended Kalman filter is constant, it is better to work with the 
inverse covariance matrix J{t) := P{t)~^, and the extended Kalman filter on 
9 can be rewritten as 

J(t)^ J(t-l) + ^^/t^ (21) 

9^9-J{t)-^^ (22) 

where yt is the prediction at time t, where both ^ and ^ can be computed 
from h{t) via the chain rule if G{t) = is known, and where It is the 

^Indeed, in standard Kalman filter notation, one has KtR = PtHj, so that for the 
quadratic loss £ = |(y — yYR~^{y — y) (log-loss of a Gaussian model with coraviance 
matrix R), the Kalman update for 9 is equivalent to 9 ^ 9 — P{t) 
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Fisher information matrix of yt as a probability distribution on y^. (For 
exponential families this is just the Hessian of the loss with respect 

to the prediction). This is the so-called information filter, because J{t) 
approximates the Fisher information matrix on 9 given the observations up 
to time t. This is basically a natural gradient descent on 9. 

This approach is summarized in Algorithm 2, which we describe more 
loosely since matrix approximation schemes may depend on the application. 

Algorithm 2 uses a decay factor (1 — yt) on the inverse covariance matrices 
to limit the influence of old computations made with outdated values of 9. 
The factor yt also controls the effective learning rate of the algorithm, since, 
in line with Kalman hltering, we have not included a learning rate for the 
update of 9 (namely, rjt = 1): the step size is adapted via the magnitude of 
J. For yt = 0, J grows linearly so that step size is 0{l/t). 

Moreover, we have included a regularization term A for matrix inversion; 
in the Bayesian interpretation of Kalman filtering this corresponds to having 
a Gaussian prior on the parameters with inverse covariance matrix A. This 
is important to avoid fast divergence in the very first steps. 

In practice we have used yt = 0{l/y/t) and A = (dimh).Id. 

The simplest and fastest way to approximate the Fisher matrix in Algo¬ 
rithm 2 is the outer product approximation (see discussion in [01115a]), which 
we have used in the experiments below. Namely, we simply use fi ^ 
so that the updates to J^p and Jq simplify and become rank-one outer product 
updates using the gradient of the loss, namely, Jg ^ (1 — yt) Jg -f ^ and 
likewise for (p. Here the derivative ^ is estimated from the current gradient 
estimate G. 

For the matrix reductions, we have used a block-wise quasi-diagonal 
reduction as in [01115a]. This makes the cost of handling the various matrices 
linear in the number of parameters. 

1.3 Examples 

Let us show how Algorithm 1 works out on explicit examples. 

The importance of norm rescaling. Let us hrst consider a simple 
dynamical system which illustrates the importance of rescaling the norms by 
p and Pi- Let 0 < a < 1 and consider the system 

h{t -f 1) = (1 — a)h{t) + 9 (28) 

with both h and 9 in M”. This quickly converges towards 9/a. We have 
df/dh = (1 — a) Id and df/d9 = Id and so dfijd^ = Cj, the i-th basis 
vector. Then the reduction and transition steps in Algorithm 1, if the scalings 
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Parameters: h(0) (initial state), 9 q, (pQ (initial value of the 
parameters), 0 ^ 7 t < 1 (covariance decay parameter scheme), and 
Ag (inverse covariance matrix of the prior on the parameters); 
Maintains: Same as Algorithm 1, plus a representation of matrices 
Jg and allowing for efficient inversion; 

Subroutines: A matrix reduction method MatrixReduce(M) which 
only evaluates a small number of entries of its argument M and 
returns an approximation of M that can be inverted efficiently; 

A routine FisherApprox(yi, which returns either a positive definite 
approximation of the Fisher information matrix of yt as a probability 
distribution on yt, or a positive definite approximation of the Hessian 
respect to the prediction. 


Initialization: as in Algorithm 1, and Jg •(— 0, 0; 

for t = 0 to end-of-time do 

Observation step: as in Algorithm 1. 

Update step: Compute approximate Fisher information matrix 
w.r.t. yt: 

It FisherApprox(yt, yt) (23) 

Compute derivative of prediction and of loss with respect to output 
parameters, ^ and Update inverse covariance matrix of 
output parameters p: 


Jip {1 — 'yt) Jif + MatrixReduce 


/ dyt \ 
[dp ^dpj 


and update output parameters: 


p^p 


{Jip + A,^) ^ 


dp 


(24) 


(25) 


Compute derivative ^ of prediction with respect to current state 
h{t). Update inverse covariance matrix of internal parameters 6: 


Jg {1 — 'yt)Jg + MatrixReduce 


G 


yT 


dyt 

dh 


T 

It 



(26) 


and update internal parameters 9: 

9^9-{Jg + Ag)-^69 (27) 

where 59 := {Hv)w — is the update of 9 from Algorithm 1. 

Reduction step: Same as in Algorithm 1, but the norms used to 
compute p and pi are derived from J^^ (cf. Appendix B). 
Transition step: Same as in Algorithm 1. 
end 

Algorithm 2: NoBackTrack algorithm, Kalman version. 



p are not used, amount to 


(29) 

(30) 


vt+i = (1 - a) {vt + 
wt+i = wt + 

with the ei{t) independent at each step t. The resulting estimate of dh{t)/d9 
is unbiased, but its variance grows linearly with time. Indeed, the dynamics 
of Vt is stationary thanks to the factor (1 — a), but the dynamics of wt is 
purely additive so that wt is just a d-dimensional random walk. On the other 
hand, if rescaling by p is used, then both v and w get rescaled by \/l — a at 
each step,® so that their dynamics becomes stationary and variance does not 
grow. 


Recurrent neural networks. The next example is a standard recurrent 
neural network (RNN). The state of the system is the set of pre-activation 
values /ij(t), and the activities are ai{t) := a{hi{t)) where a is some activation 
function such as tanh or sigmoid. The recurrent dynamics of h is 

hi{t -h 1) = ^ Wji cr{hj{t)) + riiXi{t) (31) 

I 


in which h{t), h{t + l) G M”, {Wji)j^i are a set of weights defining a graph on 
n nodes, and are the input weights.^® The parameter is d = (W,r). 

We hereby omit the output part of the network,as it is of no use to analyze 
the estimation of dh{t)/d0. 

(We have chosen the pre-activation values h, rather than the activities 
a = cr{h), as the state of the system. This results in simpler expressions, 
especially for the input weights r.) 

Thus, the function / defining the dynamical system for the variable h is 
(31). The derivatives of / are immediately computed as dfi/dWji = cr{hj), 
dfi/drii = xi, dfi/dhj = Wjia'{hj), and all other derivatives are 0. 

Algorithm 1 maintains, after the reduction step, an approximation ~ 
v{t)w{ty. We can decompose w{t) = {W{t),r{t)) into the components 
corresponding to the internal and input weights of the parameter 8 = (IT, r), 
so that 


dhi{t) 

dWkj 

dhi{t) 

drij 


Vi{t)Wkj{t) 




(32) 

(33) 


®Proof: By induction one has v = w after the reduction step and v = [1 — a)w after 
the transition step, and p = \/l ~ ot. 

^°Biases are omitted; they can be treated by the inclusion of an always-activated united 
io with aio(t) = 1. 

^'^The experiments below use a softmax output with output parameters ip, see Section 2. 
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By plugging the values of the partial derivatives of / into Algorithm 1, 
we find the following update equations for the value oiv^W and r right after 
the reduction step: 


Vi{t + 1) 


Wkj{t + 0 ) 




+eipi 

j^i 

Wkj{t) ^ ^ (j{hk{t)) 

_ ~T Sj 

P Pj 

rij{t) , _ xi{t) 

- ' ^ . 


(34) 

(35) 

(36) 


where the ej are independent symmetric binary random variables, taking 
values ±1 with probability Any non-zero choice of pj leads to an unbiased 
estimation, though the values are to be optimized as mentioned above. 

Applying this update has the same algorithmic cost as implementing one 
step (31) of the recurrent network itself. 


Leaky recurrent neural networks. To capture long-term dependencies, 
in the experiments below we also use a leaky RNN, obtained via the addition 
of a direct feedback term: 

hi{t+ 1) = aihi{t)+^riiXi{t)+^Wjiaj{t) aj{t) := a{hj{t)) (37) 
I j 

with ai £ [0; 1] for all i. (See [Jae02] for similar models.) This feedback 
term reduces the impact of the vanishing gradient issue and keeps a longer 
memory of past inputs. 

This only changes the derivative of /* with respect to hj, which becomes 
dfildhj = Wjia'ihj) + aAj. Consequently the update rules (35)-(36) for 
W and r are unchanged, while the update of v becomes 

Vi{t -h 1) = paiVi{t) + p'^Wjia'{hj{t)) Vj{t) -h EiPi (38) 

j^i 


Multilayer recurrent neural networks. Let us now treat the case of a 
multilayer recurrent neural network with dynamics 

-h 1) = f^^\x{t), h^^\t),ei) (39) 

/i(2)(t + 1) = f(^\x{t),h^^\t + 1), h(2)(t), 02) (40) 

: (41) 

/iW(t + 1) = /W(x(t), + 1), h^^\t),en) (42) 

where each layer and define an RNN as in (31) above. Directly 
applying the rank-one approximation to the function / = ..., /^”^) 
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would be cumbersome: since the activity of a neuron of the i-th layer at time 
t + 1 depends on all parameters from the previous i — 1 layers, the derivative 
df Idd is not sparse. 

To cope with this, a natural approach is to treat the dynamics in a 
“rolling” fashion and apply the rank-one approximation at each layer in turn. 
Formally, this amounts to defining the following model 

h^^\nt + l) = f^^\x{nt),h^^\nt),9i) (43) 

-I- 2) = -|- 1), + 1), 6 * 2 ) (44) 

: (45) 

+ n) = f^'^\x{nt + n — 1), h^'^\nt + n — l),9n) (4:6) 

with x{t) := x{[t/n\), and where states not explicitly appearing in these 

equations stay unchanged + j) = + j — 1) for i ^ j). Thus, 

the transition function explicitly depends on time (more precisely, on time 
modulo the number of layers), and is sparse at each step. Indeed, at each 
step, applying the transition function amounts to applying one of the to 
the corresponding layer, and leaving the other layers unchanged. Thus the 
derivative of with respect to any 9j, j 7 ^ i, is zero; this leaves only the 
gradient of wrt 9i to be dealt with, and Algorithm 1 or 2 can be applied 
at little cost. 

1.4 Extensions 

Rank-X reductions. A first obvious extension is to use higher-rank re¬ 
ductions. The simplest way to achieve this is to take several independent 
random rank-one VkVjl reductions in (7) and average them. Note that Wi 
(Algorithm 1) has to be evaluated only once in this case. It might be slightly 
more efficient to first split the parameter components into K blocks (e.g., at 
random) so that the /c-th term wj. only involves parameters from the fe-th 
block: indeed, applying the evolution equation for G preserves this structure 
so this requires less memory for storage of the Wk- 

Algorithms similar to RTRL. Other algorithms have been proposed 
that have the same structure and shortcomings as real-time recurrent learning, 
for instance, the online EM algorithm for hidden Markov models from [Capll]. 
In principle, the approach presented here can be extended to such settings. 

Continuous-time systems. Another extension concerns continuous-time 
dynamical systems 

^ = F{h{t),x{t),e) (47) 
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which can be discretized as h{t + 5t) = h{t) + 6tF{h{t), x{t), 9). Thus this is 
analogous to the discrete-time case via / = Id-|-(5tT, and Algorithm 1 may 
be applied to this discretization. 

When performing the rank-one reduction (7), the scaling by pi = \/||rci|| / \\vi 
is important in this case: it ensures that both v and tD change by 0{y/M) 
times a random quantity at each step. This is the expected correct scaling 
for a continuous-time stochastic evolution equation, corresponding to the 
increment of a Wiener process during a time interval 5t. (Without scaling 
by Pi, there will be no well-defined limit as 6t —)• 0, because v would change 
by 0(1) at each step t t+ 6t, while w would evolve by 6t times a centered 
random quantity so that it would be constant in the limit.) Further work is 
needed to study this continuous-time limit. 

2 Experiments 

We report here a series of small-scale experiments on text prediction tasks. 
The experiments focus on two questions: First, does learning using the 
rank-one approximation G accurately reflect learning based on the actual 
gradient G computed exactly via RTRL, or is the noise introduced in this 
method detrimental to learning? Second, how does this approach compare 
to truncated backpropagation through time? 

We used the RNN or leaky RNN models described above to predict a 
sequence of characters y{t) in a finite alphabet A, given the past observations 
x(s) = y{s) for 1 ^ s ^ t — 1. At each time, the network outputs a 
probability distribution on the next character z; explicitly, the output at 
time t is y{t) £ defined by 

y{t)z := Lpizai{t) (48) 

i 

for each z G A, with parameters ip = {pz,piz)i,z- The output y = {yy)y^A 
defines a probability distribution on A via a softmax Py{y) '■= 

and the loss function is the log-loss on prediction of the next character, 
it ■= ~ log 2 P^(t)(y(t)). The internal and output parameters 9 and p are 
trained according to Algorithms 1 and 2. 

We used three datasets. The first is a “text” representing synthetic 
music notation with several syntactic, rhythmic and harmonic constraints 
(Example 3 from [01115b]). The data was a file of length ~ 10^ characters, 
after which the signal cycled over the same file. The second dataset is the 
classical example, synthesized by repeatedly picking an integer n at 
random in some interval, then outputting a series of n a’s followed by a line 
break, then n 6’s and another line break. This model tests the ability of 
a learning algorithm to learn precise timing and time dependencies. The 
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third example is the full set of Shakespeare’s works, obtained from Project 
Gutenberg.The file is roughly 5.10® characters long. 

The benchmarks included are gzip, a standard non-online compression 
algorithm, and context tree weighting (CTW) [BEYY04], a more advanced 
online text compression algorithm, as well as the actual entropy rate of the 
generative model for synthetic music and a^U^. 

The code used in the experiments is available at http: //www. yann-ollivier. 
org/rech/code/nobacktrack/code_nobacktrack_exp.tar.gz 

Euclidean NoBackTrack. We first study whether the low rank approxi¬ 
mation in the Euclidean version of NoBackTrack impacts the gradient descent. 

For this first set of experiments, we use a fully connected RNN with 20 units, 
as described above, on the synthetic music example. We compared RTRL, 
Euclidean rank-one NoBackTrack, and Euclidean NoBackTrack using rank- 
two and rank-ten reductions (obtained by averaging two or ten independent 
rank-one reductions, as discussion in Section 1.4). 

The results are summed up in Figure 1 and Figure 2. All the models were 
trained using the same learning rate rjt = 1/Vi for Figure 1 and rjt = 0.03/-v/t 
for Figure 2. 

The various algorithms were run for the same amount of time. This is 
reflected in the different curve lengths for the different algorithms; in partic¬ 
ular, the curve for RTRL is much shorter, reflecting its higher computational 
cost. (Note the log scale on the t axis: RTRL is roughly 20 times slower with 
20 units.) 

The impact of stochasticity of the low-rank approximation when using 
large learning rates is highlighted on Figure 1: Euclidean NoBackTrack with 
a large learning rate displays instabilities, even when increasing the rank of 
the approximation. 

Smaller learning rates allow the algorithm to cope with this, as the noise 
in the gradients is averaged out over longer time spans. This is illustrated in 
Figure 2, in which the trajectories of Euclidean NoBackTrack track those of 
RTRL closely even with a rank-two approximation. 

Kalman NoBackTrack. Next, we report the results of the Kalman ver¬ 
sion of NoBackTrack on the same experimental setup. A quasi-diagonal 
outer product (QDOP) approximation [01115a] of the full Kalman inverse 
covariance matrix is used, to keep complexity low. 

We compare the low-rank approximations to RTRL. To make the com¬ 
parison clear, for RTRL we also use a quasi-diagonal (QDOP) approximation 
of the Kalman Altering algorithm on top of the exact gradient computed by 
RTRL. 

^^www. gutenberg.org 
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Number of characters read 


Figure 1: Average log-loss (bits per character) on synthetic music as a 
function of the number of characters read, for an RNN with 20 units trained 
with the Euclidean version of the NoBackTrack algorithm for different rank 
values and RTRL, with learning rate r]t = benchmarked against the 

true model entropy rate, gzip, and CTW. 


Learning rates were set to 1 and all algorithms were run for the same 
amount of time. 

The use of the QDOP-approximated Kalman inverse covariance appears 
to fully fix the unstable behaviour. Overall, low-rank approximations appear 
to be roughly on par with QDOP RTRL. There is no obvious gain, on this 
particular example, in using higher-rank approximations. 

Still, on this particular task and with this particular network size, none 
of the RNN algorithms (including BPTT reported below) match the perfor¬ 
mance of Context Tree Weighting. RNNs beat CTW on this task if trained 
using a non-online, Riemannian gradient descent [01115b] (analogous to using 
the Kalman inverse covariance). So this is arguably an effect of imperfect 
online RNN training. 

Kalman NoBackTrack and truncated BPTT. Our next set of exper¬ 
iments aims at comparing Kalman NoBackTrack to truncated BPTT, with 
truncation^^ parameter T = 15. As BPTT truncates the full gradient by 

the version of BPTT used here, the algorithm does not backtrack by T steps at 
every time step t; rather, it waits for T steps between t and t + T, then backtracks by T 
steps and collects all gradients in this interval. Otherwise, truncated BPTT would be T 
times slower, which was unacceptable for our experiments. 
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Number of characters read 


Figure 2: Average log-loss (bits per character) on synthetic music as a 
function of the number of characters read, for an RNN with 20 units trained 
with the Euclidean version of the NoBackTrack algorithm for different rank 
values, and with RTRL, with learning rate rjt = 0.03/\/t, benchmarked 
against the true model entropy rate, gzip, and CTW. 


removing dependencies at distances longer than the truncation parameter, we 
expect Kalman NoBackTrack to learn better models on datasets presenting 
long term correlations. 

The two algorithms are first compared on the synthetic music dataset, 
with the same experimental setup as above, for the same amount of time, 
with a learning rate r]t = 1/Vi for truncated BPTT and = 1/v^ for 
Kalman NoBackTrack.^'^ The results are shown in Figure 4. 

On this example, truncated BPTT perfoms better than Kalman NoBack¬ 
Track, even though the two algorithms display broadly comparable perfor¬ 
mance. Noticeably, RTRL and truncated BPTT are roughly on par here, 
with truncated BPTT slightly outperforming RTRL in the end: apparently, 
maintaining long term dependencies in gradient calculations does not improve 
learning in this synthetic music example. 

Next, to compare NoBackTrack and truncated BPTT on their specihc 
ability to learn precise middle and long term dependencies, we present 
experiments on the example. This will clearly illustrate the biased 
nature of the gradients computed by truncated BPTT. 

^'^These learning rates have different meanings for Kalman NoBackTrack and truncated 
BPTT, and are not directly comparable. 
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Number of characters read 


Figure 3: Average log-loss (bits per character) on synthetic music as a 
function of the number of characters read, for an RNN with 20 units trained 
with the Kalman/QDOP version of the NoBackTrack algorithm for different 
rank values and Kalman/QDOP RTRL, benchmarked against the true model 
entropy rate, gzip, and CTW. 


The dataset is synthesized by sequentially picking a number n 

between k and I uniformly at random, then outputting a series of n a’s 
followed by a line break, then n 6’s and another line break. The true entropy 
rate is in this example. A roughly 10® character long input 

sequence was synthesized, using [fc; Z] = [1;32]. 

As standard RNN models do not seem to be able to deal with this example, 
whatever the training algorithm, we used a leaky RNN^® as presented in 
Section 1.3, again with 20 fully connected units. All the algorithms used 
a learning rate of 1/y/i. The results are reported on Figure 5, which also 
includes the entropy rate of the exact a"'b'^ model and the (twice larger) 
entropy rate of an aPW model with independent n and p. 

Kalman NoBackTrack clearly outperforms truncated BPTT on this 
dataset. This was to be expected, as the typical time range of the tem¬ 
poral dependencies exceeds the truncation range for BPTT, so that the 

^®Indeed, logjlZ — fc -I- 1) bits are needed to encode the value of n in each block 
(this is the entropy of a uniform distribution on {k,..., 1}), and the average value of n is 
(fc -I- l)/2 so that the average length of an block, including the two newline symbols, 
is 2 X (1 -I- A:)/2 -f 2. 

^®The parameter a of the LRNN can be learned, but this sometimes produces numerical 
instabilities unless cumbersome changes of variables are introduced. We just initialized a 
to a random value separately for each unit and kept it fixed. 
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2.5 


— Truncated BPTT 

— RTRL 



Figure 4: Average log-loss (bits per character) on synthetic music as a 
function of the number of characters read, for an RNN with 20 units trained 
with the Kalman/QDOP version of the NoBackTrack algorithm for different 
rank values, Euclidean RTRL, and truncated BPTT, benchmarked against 
the true model entropy rate, gzip, and CTW. 



Figure 5: Average log-loss (bits per character) on the 32 ] dataset, as a 
function of the number of characters read, for a leaky RNN with 20 units, 
trained with a Kalman/QDOP version of the NoBackTrack algorithm, RTRL, 
and BPTT. 
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Number of characters read 


Figure 6: Average log-loss (bits per character) on Shakespeare’s works as a 
function of the number of characters read, for an RNN with 20 units trained 
with the QDOP version of the NoBackTrack algorithm for different rank 
values, Euclidean RTRL and truncated BPTT, benchmarked against gzip 
and CTW. 

approximated gradients computed by truncated BPTT are signihcantly 
biased. 

Keeping track of the long term dependencies is key here, and RTRL 
outperforms all the algorithms epochwise, though it is still penalized by its 
high complexity. Truncated BPTT is unable to learn the full dependencies 
between a’s and 5’s, and ends up closer to the entropy of an a^lf model with 
independent values of n and p (presumably, it still manages to learn the 
a"'6"' blocks where n is short). At some point the learning curve of truncated 
BPTT appears not to decrease anymore and even goes slightly up, which is 
consistent with a biased gradient estimate. 

On the other hand, Kalman NoBackTrack seems to be mostly successful 
in learning the dependencies. This is conhrmed by visual inspection of the 
output of the learned model. The small remaining gap between the true 
model and the learned model could be related to incomplete training, or to 
an imperfect modelling of the exact uniform law for n G [k',l]. 

Finally, we report performance of truncated BPTT and Kalman NoBack¬ 
Track on Shakespeare’s works. The same 20-unit RNN model is used, again 
with all algorithms run for the same amount of time using the same learning 
rate IjVi- The curves obtained are displayed in Figure 6. 

On this example, RTRL, truncated BPTT, and Kalman NoBackTrack 
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with various ranks all have a similar performance; it is not clear whether 
the differences on Figure 6 are statistically significant. This proves, once 
more, that the stochasticity and rank reduction inherent to NoBackTrack 
are not detrimental to learning, and allow it to keep up with exact gradient 
algorithms. 

All RNN algorithms have a significantly worse performance than CTW 
on this example, thus proving that a 20-unit RNN does not accurately model 
Shakespeare’s works. 


Conclusion. We have introduced an algorithm that computes a stochastic, 
provably unbiased estimate of the derivative of the current state of a dy¬ 
namical system with respect to its parameters, in a fully online fashion. For 
recurrent neural networks, the computational cost of this algorithm is com¬ 
parable to that of running the network itself. Previously known algorithms 
were either not fully online or had a significantly higher computational cost. 

In our experiments, this algorithm appears as a practical alternative to 
truncated backpropagation through time, especially in its Kalman version, 
while the Euclidean version requires smaller learning rates. The (unbiased) 
noise and rank reduction introduced in the gradient approximation do not 
appear to prevent learning. The interest of NoBackTrack with respect to 
truncated BPTT depends on the situation at hand, especially on the scale of 
time dependencies in the data (which results in biased gradient estimates for 
BPTT), and on whether the storage of past states and past data required by 
truncated BPTT is acceptable or not. 

Acknowledgments. The authors would like to thank Hugo Larochelle for 
his helpful questions that resulted in several clarifications of the text. 

A Variance of the rank-one trick 

Keep the notation of Proposition 1 and let H-lj be a Euclidean norm on the vector 
space in which the Vi and Wi live. 

To measure the variance of A we use the Hilbert-Schmidt norm ||A||yg := 

Tr(A^A). This norm satisfies 11 | jj 3 = Ill'll II'U'IL and ('CiU’l | n 2 W^)jjg = {vi | 'C 2 ) (ici \ W 2 ) 

for the associated scalar product. 

Let us evaluate the variance of A in this norm. Since Var A = E A EA 

II II rib II II Hb 

~ 11 ~ 11 2 
and EA = A is fixed, it is enough to evaluate the second moment E ||A||jjg. 

We claim that 

+ \vj) {wi\ Wj) (49) 

i j i j^i 
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Indeed, A = J^ij SiSjViW^ so, by bilinearity of the Hilbert-Schmidt scalar product, 
E||^||hs = M)hs ^ E'^e^ejekei {vi \vj) {wk \ wi) (50) 

ijkl 

Since Ee^ = 0 and E(ei£j) = 0 for i ^ j, the only cases to consider are: 

1. i = 2 and k = I and i ^ k\ contribution ll'^fcll^ 

2. i = k and j = I and i ^ j: contribution \'^i) {"^i I ) 

3. i = I and j = k and i ^ j'. same contribution as the previous one 

4. i = j = k = 1: contribution ||r^i||^ llw^ill^ 

5. all other cases contribute 0. 

The first and fourth contributions add up to ll^fell^)- This proves 

(49). 

Let us minimize variance over the degrees of freedom given by ViwJ = {piVi){wi/piY. 

II "112 

Pi does not change the last contribution to E || A||in (49), neither does it change 
the expectation EA = A, so to minimize the variance we only have to minimize the 
first term (^- ||fi||^)(X]fc ll^fcll^)- Applying the scaling, this term becomes 

(E 11^*11'(51) 

i k 


and, by differentiation with respect to a single pi, one checks that this is minimal 
for 

p^ oc x/ikEnNl (52) 

(mutliplying all pi’s by a common factor does not change the result). So, after 
optimal scaling. 


A = 




£iV, 




E^iw^ivlh 


(53) 


Consequently, after scaling, the first term in the variance of A in (49) becomes 
(Si Ikill The second term in (49) does not change. 

Thus, after optimal scaling we find 

^ (e 11^*11 +2EE I I (^"^) 

\ i / i j=^i 


To obtain the variance of A, we just subtract the square norm of EA = A, which is 


WMm = 




HS 


E II^^'^TIIhs + E E (I (^^) 

i i 


(by bilinearity of the Hilbert-Schmidt scalar product) 


= E 11^*11^ ikif+EE \vj}{wY Wj) 

i i j=jti 


(56) 


21 









This yields, after optimal scaling, 

Var A = (vi I Vj) (Wi I w^) (57) 

Vi / i i j^i 

= XI11^*11 + (^* I I ^j) (58) 

i j=/=i 

B Invariant norms derived from the Kalman co- 
variance 


Algorithm 2 is built to offer invariance properties (a Kalman filter over a variable 9 is 
invariant by affine reparameterization of 9, for instance). However, this only holds if 
the norms ||i;||, ||u;||, ||ui||, ||wi||, used to compute the scaling factors p = \/iM[71iW 
and Pi = ^/llwill / ||ei||, are themselves reparameterization-invariant. 

This can be achieved if we decide to choose the scalings p as to minimize the 
variance of G computed in the (Mahalanobis) norm defined by the covariance matrix 
of 9 and of h appearing in the Kalman filter. 

Let Cg be the covariance matrix of 9 obtained in the Kalman filter; in Algorithm 2, 
Cg is approximated by Cg « Jg^- 

Any linear form on 9, such as w and Wi, can be given a norm by 

llu'lj^ := w^Cgw « Jg^w (59) 


and likewise for Wi. This norm is invariant under 0-reparameterization. 

Given the covariance Cg of 9 and the dependency G = ^ oi h with respect to 
0, the covariance of h is 

Ch ■■= GCgG^ (60) 

and its inverse Ju '■= can be used to define a norm for a tangent vector v at 
state h via 

\\vf:=v^JhV (61) 

which is also reparametrization-invariant. (We use Jg^ for the norm of w and Jh 
for the norm of v because r; is a tangent vector (covariant) at point h, while w is a 
linear form (contravariant) at point 0.) 

However, handling of full covariance matrices would be too costly. In Algorithm 2, 
the inverse covariance Jg of 0 is already an approximation (diagonal, quasi-diagonal...) 
via MatrixReduce. Moreover, here we only have access to an approximation G of 
G. Thus, we simply replace G with G in the definition of Ch, and use a diagonal 
reduction. This leads to Ch « Diag(GJ^^G'^) and 

(Diag(GJ,-iG^))”' (62) 

where as usual G is the gradient approximation given by (9). 

The diagonal reduction is necessary if G is low-rank, since GJ^^G^ will be 
low-rank as well, and thus non-invertible. 

Then the scaling factors p and pi can finally be computed as 


P = 



{W^Jg 






1/4 


(63) 
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and 


Pr = 





(64) 


The particular structure of Jg (if approximated by, e.g., a block-diagonal matrix) 
and of G = vv)^ + GwJ make these computations efficient. 

Note that even with the approximations above, G is still an unbiased estimate 
of G. Indeed, any choice of p has this property; we are simply approximating the 
optimal p which minimizes the variance of G. 

In practice, small regularization terms are included in the denominator of every 
division and inversion to avoid numerical overflow. 
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